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Abstract. We define a new kind of automata recognizing properties of data words or 
data trees and prove that the automata capture all queries definable in Regular XPath. 
We show that the automata-theoretic approach may be applied to answer decidability and 
expressibility questions for XPath. 



In this paper, we study data trees. In a data tree, each node carries a label from a finite 
alphabet and a data value from an infinite domain. We study properties of data trees, such 
as those defined in XPath, which refer to data values only by testing if two nodes carry the 
same data value. Therefore we define a data tree as a pair (t, ~) where t is a tree over a 
finite alphabet and ~ is an equivalence relation on nodes of t. Data values are identified 
with equivalence classes of ~. 

Recent years have seen a lot of interest in automata for data trees and the special 
case of data words. The general theme is that it is difficult to design an automaton which 
recognizes interesting properties and has decidable emptiness. 

Decidable emptiness is important in XML static analysis. A typical question of static 
analysis is the implication problem: given two properties ipi , <f2 of XML documents (mod- 
eled as data trees), decide if every document satisfying (pi must also satisfy (f2- Solving the 
implication problem boils down to deciding emptiness of ipi A -k/>2- 

A common logic for expressing properties is XPath. For XPath, satisfiability is unde- 
cidable in general, even for data words, see [l]. This means that most problems of static 
analysis are undecidable for XPath, e.g. the implication problem. Satisfiability is undecid- 
able also for most other natural logics on data words or data trees, including first-order 
logic with predicates for order (or even just successor) and data equality. 

The approach chosen in prior work was to find automata on data words or trees that 
would have decidable emptiness and recognize interesting, but necessarily weak, logics or 
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fragments of XPath. These weak logics include: fragments of XPath without recursion or 
negation [111; first-order logic with two variables [4j[5j ; forward-only fragments related to 
alternating automata [8-10 12 . The original automaton model for data words was 13 



See [14] for a survey. 

In this paper, we take a different approach. Any model that captures XPath will have 
undecidable emptiness. We are not discouraged by this, and try to capture XPath by 
something that feels like an "automaton". Three tangible goals are: 1. use the automaton 
to decide emptiness for interesting restrictions of data trees; 2. use the automaton to prove 
easily that the automaton (and consequently XPath) cannot express a property; 3. unify 
other automata models that have been suggested for data trees and words. 

What is our new model? To explain it, we use logic. From a logical point of view, a 
nondeterministic automaton is a formula of the form 3Xi . . . 3X n ip(Xi, . . . , X n ), where the 
kernel ip is relatively simple, e.g. it only talks about the relationship of labels in successive 
positions. As often in automata theory, when designing the automaton model, we try to 
use the prefix of existential set quantifiers as much as possible, in the interest of simplifying 
the kernel (p. For satisfiability, this is like a free lunch, since deciding satisfiability with or 
without the prefix are the same problem. 

In the automaton model that we propose in this paper, the kernel ip is of the form "for 
every class X of ~, property ip(X,Xx, . . . ,X n ) holds", where ip is an MSO formula that 
can use predicates for navigation (sibling order, descendant), predicates for testing labels 
from the finite alphabet, but not the predicate ~ for data equality. The data ~ is only 
used in saying that X is a class. In the case of data words, this model is an extension of 
the data automata introduced in [51, which correspond to the special case when first-order 
quantifiers in ip range only over positions from X. For instance, our new model, but not 
data automata, can express the property "between every two different positions in the same 
class there is at most one position outside the class with label a" . 

The principal technical contribution of this paper is that the model above can recognize 
all unary queries of XPath. This proof is difficult, and takes over ten pages. We believe the 
real value of this paper lies in this proof, which demonstrates some powerful normalization 
techniques for formulas describing properties of data trees. Since the scope of applicabil- 
ity for these techniques will be clear only in the future; and since the appreciation of an 
"automaton model" may ultimately be a question of taste, we describe in more details the 
three tangible goals mentioned above. 

1. The ultimate goal of this research is to find interesting classes of data trees which 
yield decidable emptiness for XPath. As a proof of concept, we define a simple subclass 
of data trees, called bipartite data trees, and prove that emptiness of our automata (and 
consequently of XPath) is decidable for bipartite data trees. This is only a preliminary 
result, we intend to find new subclasses in the future. 

2. We use the automaton to prove that XPath cannot define certain properties. Proving 
inexpressibility results for XPath is difficult, because the truth value of an XPath query in a 
position x might depend on the truth value of a subquery in a position y < x, which in turn 
might depend on the truth value of a subquery in a position z > y, and so on. On the other 
hand, our automaton works in one direction, so it is easier to understand its limitations. 
We use (an extension of) our automata to prove that for documents with two equivalence 
relations ~i and ~2> some properties of two- variable first-order logic cannot be captured 
by XPath, which was an open question. 
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3. We use the automaton to classify existing models for data words in a single frame- 
work. A problem with the research on data words and data trees is that the models are 
often incomparable in expressive power. In an upcoming papeiQ we will show that many 
existing models can be seen as syntactic fragments of our automaton. We hope that this 
classification will underline more clearly what the differences are between the models. 

2. Preliminaries 

Trees. Trees are unranked, finite, and labeled by a finite alphabet E. We use the terms 
child, parent, sibling, descendant, ancestor, node in the usual way. The siblings are ordered. 
We write x < y when x is an ancestor of y. Every nonempty set of nodes xi,...,x n in a 
tree has a greatest common ancestor (the greatest lower bound wrt. <), which is denoted 
gca(xi, . . . ,x n ). 

Let t and s be two trees, over alphabets E and T, respectively, that have the same sets 
of nodes. We write t <g) s for the tree over the product alphabet SxT that has the same 
nodes as s and t, and where every node has the label from t on the first coordinate, and 
the label from s on the second coordinate. If X is a set of nodes in a tree t, we write t (g) X 
for the tree t ® s, where s is the tree over alphabet {0, 1}, whose nodes are the nodes of t 
and whose labeling is the characteristic function of X. 

Regular tree languages and transducers. We use the standard notion of regular tree 
languages for unranked trees [7j. We also use transductions, which map trees to trees. Let E 
be an input alphabet and T an output alphabet. A regular tree language / over the product 
alphabet E x T can be interpreted as a binary relation, which contains pairs (s, t) such that 
s ®t £ f. We use the name letter-to-letter transducer for such a relation, underlining that 
the trees in a pair (s, t) € / must have the same nodes. In short, we simply say transducer. 
Observe that the transducer is nondeterministic. We often treat a transducer as a function 
that maps an input tree to a set of output trees, writing t G f(s) instead of (s,t) G /. 

Data trees. A data tree is a tree t equipped with an equivalence relation ~ on its nodes 
that represents data equality. We use the name class for equivalence classes of ~. 

Queries. Fix an input alphabet. We use the name n-ary query for a function eft that maps 
a tree t over the input alphabet to a set 4>(t) of n-tuples of its nodes. In this paper, we deal 
with queries of arities 0,1,2 and 3, which are called boolean, unary, binary and ternary. We 
also study queries that input a data tree (t, ~); they output a set of node tuples tfi(t, ~) as 
well. 



^This paper is a journal version of a LICS 2010 paper [3], which included a rough description of the 
classification mentioned in item 3. However, a thorough explanation of the classification requires much 
space, and uses different techniques than the results in this paper. Therefore, we plan to present the 
classification in a separate paper. 
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MSO. Logic is a convenient way of specifying queries, both for trees and data trees. We 
use monadic second-order logic (MSO). In a given tree, or a data tree, a formula of MSO is 
allowed to quantify over nodes of the tree using individual variables x, y, z, and also over sets 
of nodes using set variables A", I 7 , Z . A formula (f> with free individual variables x\, • • • , x n 
defines an n-ary query, which selects in a tree t the set <f)(t) of tuples (xi, . . . , x n ) that make 
the formula true. To avoid confusion, we use round parentheses for the tree input of a 
query, 4>(t), and square parentheses for indicating the free variables of a query. The two 
parenthesis can appear together, e.g. 4>[xi, . . . ,x n ](t) will be the set of n-tuples selected in 
a tree t by a query with free variables xi, . . . ,x n . 

When working over trees without data, MSO formulas use binary predicates for the child 
and next-sibling relations (that allow to define descendant and following-sibling relations), 
as well as a unary predicate for each label. Queries defined by MSO with these predicates 
are called regular queries (of course, regular queries can also be characterized in terms of 
automata). When working over data trees, we additionally allow a binary predicate ~ to 
test data equality. A query using ~ is no longer called regular. For instance, the following 
unary query selects positions that have classes of size at least two: 

(f(x) = 3y x ^ y A x ^ y. 

Extended Regular XPath. We define a variant of XPath that works over data trees. For 
unary queries, the variant is an extension of XPath, thanks to including MSO as part of 
its syntax. We call the variant Extended Regular XPath. Unlike XPath, Extended Regular 
XPath allows for queries of arbitrary arity. Expressions of Extended Regular XPath are 
defined below. 

• Let T = {4>i, . . . , 4> n } be a set of already defined unary queries of Extended Regular XPath, 
which will be treated as unary predicates. In the induction base, the set T is necessarily 
empty. Suppose that ip[x\, . . . , x m ] is an MSO query that uses unary predicates for queries 
from r, unary predicates for letters of the input alphabet, and the binary child and next- 
sibling predicates. Then ip is an m-ary query of Extended Regular XPath. It is important 
that if does not introduce any new use of the data equality predicate ~, all appearances 
of ~ are reserved to the queries from T. 

• Suppose that ip[x, yi,y2] is a ternary query of Extended Regular XPath. Then the fol- 
lowing property of x is a unary query of Extended Regular XPath 

3yi3y 2 2/1 ~ 2/2 A ip[x, 2/1,7/2] ■ (2.1) 
Likewise for y\ 96 yi instead of y\ ~ 2/2 • 
The definition above allows for queries of any arity. In the paper, we will be principally 
interested in queries of arity one, and the queries of arity at most three used to build them. 
By abuse of nomenclature, we will write XPath instead of Extended Regular XPath. 

Binary trees. A binary tree is a tree where each node has at most two children. Although 
the interest of XPath is mainly for unranked trees, we assume in the proofs that trees are 
binary. This assumption can be made because XPath, as well as the models of automata 
introduced later on, are stable under the usual first-child / next-sibling encoding in the 
following sense. A language L of unranked data trees can be expressed by a boolean XPath 
query if and only if the set of binary encodings of trees from L can be expressed by a boolean 
XPath query. A similar, though more technical, statement holds for unary queries. 
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Words will be considered as a special case of binary trees where each node has at most 
one child. 

3. Class automata 

In this section we define a new type of automaton for data trees, called a class automaton, 
and state the main result: class automata capture all queries definable in XPath. 

A class automaton is a type of automaton that recognizes properties of data trees. A 
class automaton is given by: an input alphabet E, a work alphabet T, a nondeterministic 
letter-to- letter tree transducer / from the input alphabet E to the work alphabet T, and a 
regular tree language on alphabet T x {0, 1}, called the class condition. The class automaton 
accepts a data tree (t, ~) over input alphabet E if there is some output s G f(t) such that 
for every class X, the class condition contains the tree s <g> X. 

Example 1. Consider an input alphabet £ = {a, b}. Let L be the data trees where some 
class contains at least three nodes with label a. This language is recognized by a class 
automaton. The work alphabet is T = {a,c}. The transducer guesses three nodes with 
label a, and outputs a on them, other nodes get c. The class condition consists of trees 
s®X over alphabet T x {0, 1} where X contains all or none of the nodes with label a. Note 
that the class condition does not inspect positions outside X. 

Example 2. Let K be the set of data words over £ = {o, 6} where each class has exactly 
two positions x < y, and there is at most one a in the positions {x + 1, . . . ,y — 1}. In 
the class automaton recognizing K, the transducer is the identity function, and the class 
condition is 

E* ■ Ex • % ■ (a + e) • b* • Ei • Z* 
where Ej is a shortcut for E x {i}, likewise for ai and 6j. 

Comparison to data automata. Class automata are closely related to data automata 
introduced in (5l. Data automata were defined for data words. Since it is not clear what 
the correct tree version thereof is, we just present the version for data words. Like a 
class automaton, a data automaton has an input alphabet E, a work alphabet T, and a 
nondeterministic letter-to-letter transducer / (this time only for words). The difference is 
in the class condition, which is less powerful in a data automaton. In a data automaton, the 
class condition is a word language over T, and not T x {0, 1}. The data automaton accepts 
a data word (w, ~) if there is some output v G f(w) such that for every class X, the class 
condition contains the subsequence of v obtained by only keeping positions from X. In the 
realm of data words, data automata can be seen as a special case of class automata, where 
the class condition is only allowed to look at positions from the current class. The language 
L in Example [l] can be recognized by a data automaton (in the case of words), while the 
language K in Example [2] is a language that can be recognized by class automata, but not 
data automata. 

The difference between data automata and class automata is crucial for decidability 
of emptiness. Data automata have decidable emptiness [5], the proof being a reduction 
to reachability in Vector Addition Systems with States. Class automata have undecidable 
emptiness, because they capture the logic XPath, which has undecidable satisfiability. Also, 
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a direct and simple proof of undecidable emptiness for class automata can be given, by 
encoding runs of two-counter machines, without going through the difficult reduction from 
XPath. 

Closure properties. Suppose that / : Si — > S2 is an y function. We extend / to a function 
/ from data trees over alphabet Si to data trees over alphabet S2, by just changing the 
labels of nodes, and not the tree structure or data values. We use the name relabeling for 
any such function /. 

Lemma 1. Languages of data trees recognized by class automata are closed under union, 
intersection, images under relabelings, and inverse images under relabelings. 

Proof. The inverse images are the simplest: the letter-to- letter tree transducer in the class 
automaton is composed with the relabeling. For intersection, one uses Cartesian product. 
For union and images under relabelings, one uses nondeterminism. fj 

Evaluation. The evaluation problem (given an automaton and a data word/tree, check 
if the latter is accepted by the former) is NP-complete, even for a fixed data automaton 
(cf. [2]). Hence it is also NP-complete for class automata, which extend data automata. 

Class automata as a fragment of MSO. As mentioned in the introduction, one can see 
a class automaton as a restricted type of formula of monadic second-order logic. This is a 
formula of the form: 

3Xi •• • 3X n VX class(X) => <p{X x , ...,X n ,X) (3.1) 

where X\, . . . , X n , X are variables for sets of nodes, the class formula is 

class(X) = 3y\/x x £ X <==^- y ~ x 

and <p is a formula of MSO that does not use ~. Formulas of the above form recognize exactly 
the same languages of data trees as class automata. For translating a class automaton to a 
formula, one uses the variables Xi,..., X n to encode the output of the transducer, and the 
formula ip to test two things: a) the variables X\, . . . , X n encode a legitimate output of the 
transducer; and b) the class condition holds for X. 

Main result. The main result of this paper is Theorem [T] below, which says that unary 
XPath queries over data trees can be recognized by class automata. To state the theorem, 
we need to say how a class automaton recognizes a unary query. We do this by encoding a 
unary query </> over data trees as a language of data trees: 

L<f> = {(t ® X, ~) : (t, ~) is a data tree, X = (p(t, ~)}. 

In other words, the language consists of data trees decorated with the set of nodes selected 
by the query. This encoding does not generalize to binary queries. 

Theorem 1. Every unary XPath query over data trees can be recognized by a class au- 
tomaton. 

We begin the proof of Theorem [TJ mainly to show where the difficulties appear. Then, 
we lay out the proof strategy in more detail. When referring to the language of a unary 
query, we mean the encoding above. 
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We do an induction on the size of the unary query. The base case, when the query is a 
label a, is straightforward. Consider now the induction step, with a unary query 

4>[x}= 3yi3y 2 yi ~ y% A <p[x,yi,y 2 ] 

as in ( |2.1[ ). (The same argument works for the case where yi </■ y 2 .) Let (fti, . . . , cft n be all the 
unary XPath subqueries that appear in ip. By the induction assumption, the languages of 
the subqueries are recognized by class automata Ai, . . . , A n . Let the variables X, X\ , . . . , X n 
denote sets of nodes. Consider the set L of data trees 

(t <8) X ® X x <g) • • • ® X n , ~) 

such that a) for each i £ {1, . . . , n}, the data tree (t ® JQ, ~) is accepted by the automaton 
*4i; and b) X is the set of nodes selected by the query (ft' obtained from (ft by replacing each 
subquery (fti with "has 1 on coordinate corresponding to Xi" . Suppose that the language 
of (ft' is recognized by a class automaton. Then so is L, by closure of class automata under 
intersection and inverse images of projections, see Lemma [TJ Finally, the language of (ft is 
the image of L under the projection which removes the labels describing the sets X±, . . . , X n . 

It remains to show that (ft' is recognized by a class automaton (the advantage of (ft' over 
(ft is that it uses data equality ~ only once, to say that y\ ~ y 2 ). A major part of this paper 
is devoted to this case, which is stated in the following proposition. 

Proposition 1. Class automata can recognize queries 

(ft[x)= 3yi3y 2 yi ~ y 2 A (p[x, yi,y 2 ], 
where <p is a regular ternary query (i.e. tp does not use ~). Likewise for y\ 76 y 2 . 



Proof strategy. The construction of the automaton for (ft[x] is spread across several sec- 
tions. In Section |3.2[ we introduce the main concepts underlying the proof. In particular, 
we define a new complexity measure for binary relations on tree domains, called guidance 
width, that seems to be of independent interest. In Section 3.3 we start the proof itself, 
formulate an induction, and reduce Proposition [I] to a more technical Theorem [2] Then in 
Section [4] we identify a simplified form of queries appearing in Theorem [2] and show how 
arbitrary queries can be transformed to the simplified form. Finally, Section [5] contains the 
proof of Theorem [2] for these simplified queries, the heart of the whole proof. 



3.1. Discussion of the proof. In this section, we discuss informally the concepts that 
appear in the proof of Theorem [TJ For the purpose of illustration, we use words. 

We begin our discussion with words without data. For a regular binary query ip[x,y], 
consider the unary query 

ip[x] = 3y tp[x,y]. 

We use the name witness function in a word w for a function which maps every position 
x satisfying tft to some y such that (p[x, y] holds. Consider, as an example, the case where 
</2[x,y] says that there exists exactly one z that has label a and satisfies x < z < y. The 
following picture shows a witness function. 

^ r~ J Vr 1 t 

@@©©@© @©@@©@© 
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The way the picture is drawn is important. The witness function is recovered by 
following arrowed lines. The arrowed lines are colored black, dashed black, or gray, in such 
a way that no position is traversed by two arrowed lines of the same color. With the formula 
-0 in the example, any input word has a witness function that can be drawn with three colors 
of arrowed lines. This can be generalized to arbitrary MSO binary queries; the number of 
colors depends only on the query, and not the input word. 

The above observation may be used to design a nondeterministic automaton recognizing 
a property like Vx ip[x\. The automaton would guess the labeling by arrows and then verify 
its correctness. The number of states in the automaton would grow with the number of 
colors; hence the need for a bound on the number of colors. Of course, there are other ways 
of recognizing Vx tp[x], but we talk about the coloring since this is the technique that will 
work with data. 

We now move to data words. Consider a unary query 

ip[x] = 3yi3y 2 2/1 ~ 2/2 A ip[x,y 1 ,y 2 ], 

where ip[x, 2/1,2/2] says that y\ < x < 2/2, there is exactly one a label in the positions 
{yi, . . . , x — 1} and there is exactly one b label in the positions {x, . . . , 2/2}. The query i/;[x] 
is an example of a query as in Proposition [T] Consider the following data word (the labels 
are blank, a and b, the data values are 1 . . . 6) . 

OOOOO0©©©©© 
1 2345654321 

Say x is the first node with label b. This node is selected by ip. Consider the pairs (2/1,2/2) 
required by ip[x], which we call witnesses. The only possibility for 2/2 is x itself; thus 2/1 is 
also determined, as the only other position with the same data value. So there is only one 
witness pair. The same situation holds for all other positions with label 6, which are the 
only positions selected by ip. The drawing below shows how witness pairs are assigned to 
positions. 

*>•■' * / / \* \t W W \* 

OOOOOO©©©©© 
1 2345654321 

We would like to draw this picture with colored arrows, as we did for the first example 
of witness functions. If we insist on drawing arrows that connect each position x with its 
corresponding witness 2/1, then we will need 5 colors as the middle position (labeled by a) 
is traversed by 5 arrows; the picture also generalizes to any number of colors. On the other 
hand, connecting each position x with its corresponding 2/2 (a self-loop) requires only one 
color. We can symmetrically come up with instances of data words where connecting each 
node x to 2/2 requires an unbounded number of colors. 

A consequence of our main technical result, Theorem [2j is that a bounded number of 
colors is sufficient if we want to perform the following task: for each position x selected by 
V>, choose some witness pair 2/1 ~ 2/2 > and connect x to either y\ or 2/2- The bound depends 
only on ip] in particular, the bound does not depend on input data tree. 

The concepts of witness functions and coloring are defined more precisely below. 
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3.2. The core result. We will state some technical results for a structure more general 
than a data tree, namely a graph tree. A graph tree is a tree t endowed with an arbitrary 
symmetric binary relation E over its nodes. A data tree is the special case of a graph tree 
where E is an equivalence relation. 

Witness functions. Let ip[x, 2/1,2/2] be a regular query (think of Proposition [T]) , and con- 
sider a graph tree (t, E). We are interested in triples (x, yi, y%) selected by ip in t such that 
(2/1 ) 2/2) ^ E. (Think of E being either the data equivalence relation ~, or its complement.) 
Consider any such triple. The node x is called the source node] the notion of source node 
is relative to the query <p and relation E, which will usually be clear from the context and 
not mentioned explicitly. The pair (2/1,2/2) is called the witness pair, y\ is called the first 
witness, and 2/2 is called the second witness. These notions are all relative to a given x, but 
if we do not mention the x, then x is quantified existentially. Let A be a set of nodes in 
a graph tree (not necessarily containing all nodes). A witness function for ip and A in a 
graph tree is a function which maps every node x G A, treated as a source node, to some 
(first or second) witness. There may be many witness functions, since for each node we can 
choose to use either a first witness or a second witness, and there may be multiple witness 
pairs. 

The key technical result of this paper is that one can always find a witness function of 
low complexity. The notion of complexity is introduced below. 

Guidance width. A guide in a tree t is given by two nonempty sets of source nodes and 
target nodes. The support of the guide is the set of all nodes and edges on (the shortest) 
paths that connect some source node with a target node, including all the source and target 
nodes. A guide conflicts with another guide if their supports intersect. We write ir for 
guides. 

A guidance system is a set of guides II. It induces a relation containing all pairs (x, y) 
of tree nodes such that x is a source and y a target in some guide in II. An n-color guidance 
system is a guidance system whose guides can be colored by n colors so that conflicting 
guides have different colors. The guidance width of a binary relation R on tree nodes is the 
smallest n such that some n-color guidance system induces R. 

In the proof we will only consider guidance systems for relations R that are partial func- 
tions from tree nodes to tree nodes. In such cases, it is sufficient to restrict to deterministic 
guides, i.e., those with precisely one target node. From now on, if not stated otherwise, a 
guidance system will be implicitly assumed to contain only deterministic guides. 

Witness functions of bounded width. We are now ready to state the main technical 
result, which forms the core of the proof of Theorem [TJ 

Theorem 2. Let ip be a regular ternary query. There exists a constant m, depending only 
on ip, such that in every graph tree, every set of source nodes has some witness function of 
guidance width at most m. 

In other words, regular ternary queries have witness functions of bounded guidance 
width. Before proving the theorem, we show how it implies Theorem [TJ 
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3.3. Prom Theorem [2] to Proposition [TJ We show how Theorem [2] implies the last 
remaining piece of Theorem [TJ namely Proposition [TJ Consider a unary query <f>[x] as in the 
statement of Proposition [lj We begin with the case when cf)[x] is of the form 

(t>[x}= 3yi3y 2 2/i ~ 2/2 A tp{x,y 1 ,y 2 }. 

We need to find a class automaton that accepts the data trees (i <S> X, ~) where X is the 
set of all nodes selected by 4> i n the data tree (i, ~). The class automaton will test the 
conjunction of two properties: 

Completeness. Each node selected by 4> i n (£> ~) is in X. 
Correctness. Each node in X is selected by (p in (t, ~). 

Recall that ip is a regular query. We give separate class automata for the two properties. 
Completeness is simple. It can be rephrased as 

for every class Y and triple (x, 2/1,2/2) selected by ip, if yi, y 2 G Y then £ G A. 
This is the type of property class automata are designed for: for every class, test a regular 
property. (Recall the discussion on class automata as a fragment of MSO.) Correctness is 
the difficult property, since the order of quantifiers is not the same as in a class automaton: 

for every x G X there is a class Y and 2/1,2/2 G Y such that (x, 2/1, 2/2) is 

selected by if. 

Our solution is to use, as a part of the class automaton to be designed, a guidance system 
given by Theorem [2] 

Apply Theorem [2] to <p, yielding a constant m. The class automaton for the correctness 
property works as follows. Given an input data tree (t <g) X, ~), it guesses an m-color 
guidance system; let R stand for the induced relation. The automaton then checks the two 
conditions below. 

A. For every x G X there is some y with xRy. 

B. For every class Y, if xRy, x G X , y G Y , then either (x,y,y') or (x,y',y) is in (p(t), for 
some 2/ G Y. 

If the class automaton accepts, then clearly every position in A is a source node. Con- 
versely, if all nodes in X are source nodes, then there is an accepting run of the above class 
automaton. This accepting run uses the guidance system for the witness function from 
Theorem [2j 

This completes the proof for the case when <j>\x] requires 2/1 ~ 2/2- For the case 2/1 9^ 2/2, 
the proof is almost the same, except for two changes. The first change is that we apply 
Theorem [2] to the graph trees (t,E), obtained from data trees (t, ~) by taking as E the 
complement of ~. This explains why Theorem [2] is formulated for graph trees and not just 
data trees. The second change is that we write y' Y instead of y' G Y at the end of 
condition B. 

4. Simplifying the query 

Before proving Theorem [2j we formulate two simplifying conditions about the regular query 
<p[x, Vi, 2/2]- 

For two nodes x, y in a tree t, we write woidt(x,y) for the sequence of labels on the 
unique shortest path from x to y in t, including x and y. We omit the subscript t when a 
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tree is clear from the context. Note that wordt(x, y) is always nonempty and wordt(:r, x) is 
the label of x. 

The two conditions about the query tp[x, yi, 2/2] are: 

(1) All selected triples satisfy y\ < x < yi- 

(2) Whether or not a triple is selected depends only on the words wordt(yi,x) and 
word^rr, yi). It does not depend on nodes outside the path from y\ to yi- 

The goal of this section is to reduce Theorem [2] to the case when cp[x, yi, 2/2] is a sim- 
plified query, as defined above. This simplification is achieved in several steps. (In the 
case of words, the simplification would be standard, but for trees it requires new ideas 
about guidance systems.) Formally, in this section we show that Theorem [2] fo llows from 
Theorem [3] (deliberately formulated as late as in the forthcoming Subsection 4.7) that only 
speaks about simplified queries. Theorem [3] itself is proved in Section [5] 



4.1. Generalized witness functions. Fix any number n G N, although we will be mainly 
interested in n 6 {1, 2}. Consider a regular query ip[x, y%, . . . , y n ] over trees. Consider now 
a tree t together with a set E of n-tuples of nodes in t. As before, the idea is that E gives a 
constraint on the witness variables. A witness tuple for a node x is a tuple (yi, . . . , y n ) € E 
such that (x,yi, . . . , y n ) is selected by ip in t. In this case, we say that x is a source, and yi 
is an i-th witness for x (the other variables are quantified existentially) . 

A witness function for ip and a set of source nodes X in (t, E) is a function which 
assigns to each node x £ X some witness (an i-th witness for some i, with i depending on 
x). 

We say that a regular query (p[x, y%,..., y n ] has witness functions of guidance width m 
if for every tree t, every choice E of n-tuples of nodes of t and every set X of source nodes 
in (t,E), there is a witness function for ip and X of guidance width at most m. A query cp 
has witness functions of bounded guidance width if some such m exists. 



4.2. Three arrangements. By an arrangement of the nodes x, 2/1,2/2 in a tree we mean 
the information on how these nodes, and their greatest common ancestors 

gca(x,yi) gca(x,y 2 ) gca(yi,y 2 ) 

are related with respect to the descendant ordering. We distinguish three different arrange- 
ments, pictured below. 

sgca(* ! )(,) = gca(* ! )i ! ) rA gc^ ! 7i) = gcaO„7 2 ) gcafcji) =gca(»i) 






<» <& @ w 

These arrangements correspond, respectively, to the following situations. 

gca(x,t/i) = gca(x,y 2 ) < gca(yi,y 2 ) 
gca(x,yi) = gca(yi,y 2 ) < gca(x,y 2 ) 
gca(x,y 2 ) = gca(yi,y 2 ) < gca(x,yi) 



(Al) 
(A2) 
(A3) 
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The arrangements are not contradictory, for instance the case x = y\ = yi is covered by 
all three. The slightly more general case y\ = y2 that essentially represents binary queries 
(p[x,y], is fully covered by (Al). 

Lemma 2. We may assume without loss of generality that all the triples selected by (p, as 
in the statement of Theorem [2j have the same arrangement. 

Proof. Otherwise we can split p into a union of three queries, one for each arrangement, 
and then combine the three separate guidance systems. □ 



4.3. Path-based queries. Let us fix one of the arrangements. There are four words 
wi, W2, W3, w<i that will interest us. These are shown on the picture below for the arrange- 



ment (Al) only, but the reader can easily see the situation for all other arrangements. 






u>j = word(gca(.\',^),A") 



w 2 = wordCgcafr^gca^jiJ) 



w s = word(gca(jy„7.j),_y,) 



w 4 = word(gca(j Jt y 2 ),j)< 2 ; 



A regular query ip[x, 2/1,2/2] is called path-based if its truth value depends only on some reg- 
ular properties of the four words wi, ... ,11)4. The precise definition of path-based queries 
we use in this paper is in terms of monoids. A query that selects triples only in arrange- 



ment (Al) is called path-based if there exists a monoid morphism 

-> S 



a 



such that membership (x, 2/1,2/2) £ f(t) depends only on the values assigned by a to the 
words w\, . . . ,104. In other words, there is a set of accepting quadruples FCS 4 such that 
(x, 2/1,2/2) belongs to p{t) if and only if 

(a(wi), . . . ,a(u> 4 )) G F . 



An analogous definition of path-based queries is given for the other arrangements (A2) 



and (A3). 



Lemma 3. We may assume without loss of generality that ip, as in the statement of 
Theorem [2} is path-based. 

Proof. The key observation is that there is a functional letter-to-letter transducer / and a 
path-based query 7 such that ^ = 70/, i.e. for a tree t the set <p(t) of tuples selected by ip 
in t is the same as the set of tuples selected by 7 in f(t). The observation can be proved 
using logical methods (the transducer computes MSO theories) or using automata methods 
(the transducer computes state transformations). 

To prove the lemma, we need to show that if Theorem [2] is true for the path-based 
queries 7, then it is also true for arbitrary ternary queries ip. But this is straightforward, 
as ip and 7 have the same witness functions in trees t and f(t), respectively. □ 
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4.4. Composing guidance systems. In the sequel we will compose guidance systems as 
outlined in the lemma below. For two partial functions /, g on the set of nodes of a tree, 
by the composition g o / we mean, somewhat non-standardly, the partial function with the 
same domain as /, and defined as follows: 



Lemma 4. Let /, g be partial functions on the set of nodes of a tree, of guidance width mi 
and mi, respectively. Then their composition g o / is of guidance width at most 2mi 777.2. 

Proof. Fix a tree t together with some mi- and 777,2-color guidance systems II/ and U g , 
inducing / and g, respectively. We will show existence of a 2mim2-color guidance system 
for go f. 

As the first step, combine 11/ and H g as follows: a node x is first guided by II/, and 
then, if g is defined on f(x), guided by II 9 to its final destination. Formally, II contains 
those guides of 11/ whose destination node is not in the domain of g; and moreover a number 
of guides that are composed of at least two guides, to be described now. 

Fix a pair of colors (k, I), where A; is a color used in II j and I is a color used in U g . 
A composed guide, colored by the pair (k, I), is derived from one /-colored guide from ILj, 
say 7r, and all those fc-colored guides from II / whose destination node is a source node of 
it, The source nodes of the composed guide are all source nodes of all the above-mentioned 
/c-colored guides from II/. The target node of the composed guide is the target node of n. 

We will focus on the composed guides only. (A 'non-composed' guide in II, say colored 
k, may be safely considered as colored by (k, /), for any /.) 

The above coloring, using 7771777,2 colors, is not satisfactory as same colored guides may 
be in conflict. We will show how to resolve these conflicts by introducing an additionally 
distinguishing piece of data into the colors. Fix a color pair (k, /) as above. Note that a 
conflict may only arise when the II 9 -part (/-colored in U g ) of one (k, /)-colored guide, say tti, 
conflicts with the H/-part (/c-colored in II/) of another same colored guide, say tt2- Consider 
an undirected graph G, whose nodes are all (k, /)-colored guides; there is an edge between 
it 1 and 7T2 in the graph if the abovementioned conflict arises. 

We claim that the graph G is a forest, i.e., a disjoint union of trees. Towards a contra- 
diction, suppose that G has a cycle consisting of 77 pairwise different guides 7ri, . . . , ir n . Take 
7r n+ i = 7Ti. Let xi, . . . ,x n denote arbitrarily chosen nodes witnessing the conflicts, i.e., X{ 
belongs to the supports of guides 7Tj and 7Tj+i. In 7Tj+i, for any i < 77, there is a unique path 
from Xi to Xi+i (take x n+ \ as x\), denote it pf, p, L always uses a path of a guide from II/, 
colored k, and a path of a guide from H g , colored /. As the /c-colored guides never conflict, 
and likewise the /-colored ones, the /c-colored part of pi is separated from the same colored 
part of pi+i by at least one /-colored edge; thus the paths p% are nonempty, i.e., X{ / Xi + \. 
Assume that xi,...,x n are pairwise distinct (if this is not the C3jS6 ; 1.6. ? X{ — X j ' ^ consider 
X% ; . . . 5 Xj — 1 instead; and consider m,..., 7Tj_i instead of m, . . . , 7r„). 

Now we are prepared to obtain a contradiction, thus proving that G is a forest. If two 
paths pi and pi+i share an edge adjacent to Xj+i, the edge may be removed from both 
paths; this clearly forces Xi + \ to be replaced appropriately. Thus the paths can be made 
edge-disjoint; moreover we keep the Xi nodes pairwise distinct, argued as above. Hence the 
paths pi, . . . ,p n form a cycle in the tree t, a contradiction. 
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Knowing that G is a forest, we may easily label its nodes by two numbers 1,2, level 
by level, starting from an arbitrary leaf in any connected component. This additional 
numbering, added to the colors of the guides in n, eliminates the problematic conflicts and 
makes IT a 2mi?7i2-color guidance system as required. D 



4.5. Binary queries. 

Lemma 5. Every binary regular query <p[x, y] has witness functions of bounded guidance 
width. 

Proof. Whenever a pair (x, y) belongs to <p(t), call the node z = gca(x, y) an x-intermediate 
node, and call y a z-final node. We will define two guidance systems, the first one directing 
any source node x to an x-intermediate one, and the second one directing any intermediate 
node z to a z-final one. The two guidance systems will be combined using Lemma |4} 

A binary query is essentially a degenerate case of ternary query, with y\ = y 2 . By 
Lemma [3] assume that cp is path-based. Thus its truth value in a tree t only depends on 
some regular properties of two words 

w\ = wordt(x, gca(x, y)) and w 2 = wordj(gca(x, y), y), 



as depicted in the figure in Section 4.3 (Words W3 and W4 are empty as y% = 2/2 •) Namely, 
(x,y) belongs to tp(t) if and only if (a(u>i), a{w 2 )) G F, for a designated set F C S 2 . Fix 
(si,s 2 ) S F. We will define a guidance system for pairs (x,y) which satisfy 

a(w\) = si, a(w 2 ) = s 2 . 

(Then the required guidance system will be a disjoint union over all pairs (si, s 2 ) G F.) 

Assume further, without loss of generality, that x is in the left subtree, and y is in the 
right subtree of gca(x,y), including possibly y = gca(x,y). (Again, the required guidance 
system will be a disjoint union of two systems.) 

Consider deterministic word automata A\ and A 2 that recognize the properties a _1 (si) 
and a~ 1 (s 2 ), respectively. Think of a run of A.i, starting from a source node x, along the 
path in t leading from x to some x-intermediate node. Consider such runs of Ai starting 
from all source nodes x, one run from every source node. These runs may be translated 
into a guidance system, as follows. 

Each of the runs labels nodes on the path from x to an x-intermediate node with states 
of A± . The idea is that two source nodes x and x' may be directed to the same intermediate 
node if the two runs of A\ that start in x and x' label some node of t with the same state. 
In other words, one may use the same guide both for x and x' . Thus there is a guidance 
system, with as many colors as the number of states of A\ , that follows the runs of At until 
acceptance, and directs any source node x to some x-intermediate node. 

Likewise for A 2 , there is a corresponding guidance system, that leads any intermediate 
node z to some z-final node. Applying Lemma [4] for these two guidance systems we get the 
result. □ 
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4.6. Arrangement ( Al ) . In this section we show that Theorem[2]holds if all triples selected 



by cp[x, 2/1,2/2] have arrangement (Al), pictured below. 




Suppose r[x,y] is a binary query, and a[y, 2/1,2/2] is a ternary query. We define the 
following ternary query 

To y (x[x, 2/1, 2/2] = 32/ r[x,y] Aa[y,y 1 ,y 2 ] . 

Lemma 6. Let r, a be as above. If r and a have bounded width witness functions then so 
does r o y a. 

Proof. By considering the witness function for ro y a obtained as a composition and applying 
Lemma |4) □ 

Lemma 7. We may assume without loss of generality that ip only selects triples (x, 2/1,2/2) 
where x = gca(?/i, 2/2)- 



Proof. By the considerations in Section 4.3 we know that a triple (x, 2/1, 2/2) is selected by 
(p if and only if the images, under the morphism a, of the four path words ur, u>2, W3, u>4 
belong to a designated set F C S A of accepting tuples. 

Let si,...,S4 £ 5. Let r SliS2 be the binary query that selects a pair (x, y) if 

a(word t (gca(x,2/),x)) = si 

a(word t (gca(x,2/),2/)) = s 2 . 

Likewise, let a S3tSi be the ternary query that selects a triple (y, 2/1,2/2) if 

gca(2/i,2/2) = 2/ 

a(word t (2/, 2/1)) = s 3 

a(word t (2/,2/2)) = «4 • 

The queries t S1jS2 and o~ s , iySi can be joined to define tp, in the following way. 

(si,S2,S3,S4)EF 

By Lemma |6j we see that the width of witness functions for tp is bounded by the widths of 
the witness functions for the r queries, which is bounded by Lemma [5j and the width of 
the witness functions for the a queries. The latter are queries where the first variable is the 
gca of the second and third variables, which concludes the proof of the lemma. O 
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Thanks to the above lemma, we are left with a query <p that selects triples in the 
arrangement pictured below (for future reference let us call this arrangement trivial). 




We will provide a 2-color guidance system that induces a witness function for ip in (t,E). 
This is guaranteed by the following lemma: 

Lemma 8. Let ip be any (not necessarily regular) query that selects only nodes in a trivial 
arrangement. Then (p has witness functions of guidance width 2. 

Proof. The guidance system is constructed in a single root-to-leaf pass. 

More formally, for each set X of nodes that is closed under ancestors, we will provide a 
guidance system Hx that directs each node that is a source node and in X to some witness, 
either y% or 2/2. The guidance system will have the additional property that no tree edge is 
traversed by two guides. 

The guidance system is constructed by induction on the size of X. The induction base, 
when X has no nodes, is straightforward. We now show how Tlx should be modified when 
adding a single x node to X. When x is not a source node, then nothing needs to be done. 
Otherwise, suppose that x is a source node, and the witness is (2/1,2/2)- Since all guides in 
Tlx originate in nodes from X, any guide that passes through x must also pass through its 
parent. Using the additional assumption, we conclude that at most one guide ir from ILx 
passes through x. In particular, either the left subtree of x, which contains 2/1, or the right 
subtree of x, which contains 2/2, has no guide passing through it. We create a new guide 
that connects x to the witness in the subtree without a guide. □ 



For arrangement (Al) the proof of Theorem [2] is thus completed. 



4.7. Arrangements (A2) and (A3). For the remaining arrangements, in this section we 
only show how they can be reduced to the simplified ones. We formulate Theorem [3] 
below, which we will use in this section, and which follows easily from the Main Lemma 
(forthcoming Lemma [9| to be proven in the next section. 

To state the theorem, we need a new notion. A guidance system in a graph tree is 
called consistent wrt. a given ternary query if each of guides obeys the following uniqueness 
requirement: whenever a set Z C X of source nodes is guided to the same node y, then 
there is a pair (2/1,2/2) that is a witness pair for all nodes Z, with 2/1 = y or 2/2 = y. 
Roughly speaking: if all nodes in Z agree on the witness they are guided to, then they 
agree on the other witness as well. The notion of consistency is meaningful only relative 
to a given ternary query. Below, the consistency property will make it possible to combine 
two guidance systems appropriately. 

Theorem 3. Every simplified regular query has witness functions of bounded guidance 
width. Furthermore, a consistent guidance system always exists (of the required bounded 
guidance width). 



Now using Theorem [3] we prove Theorem [2] for arrangements ( A2 ) and ( A3 ) . By sym 



metry, we only consider the arrangement (IA2J) . We simplify the arrangement in two steps. 
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First we claim that without loss of generality x can be assumed to be an ancestor of y2 
- this may be shown by essentially the same technique as in Lemma [7] hence we omit the 
details. Second, we show that y\ can be assumed to be an ancestor of x and 2/2 • The 
arrangement (A2), as well as its two successive simplifications, are pictured below. 

® 








Let our starting arrangement be the middle one in the picture above, i.e. we assume 
that the first simplification has been already applied. Without loss of generality we may 
assume that yi is in the left subtree, and x in the right subtree of the gca(x, yi) node (thus 
we again split into two sub-cases), and that both x and y\ are not equal to gca(x,yi). 



By the considerations in Section 4.3, we know that a triple (x, 2/1,2/2) is selected by <p 
in t if and only if the images, under the morphism a, of the three path words: 

word t (x,gca(x,yi)), woid t (x,y 2 ), word 4 (gca(x, yi), y x ), 

belong to a designated set F C S 3 of accepting tuples. 

Fix (si, 52,53) £ F. We define a guidance system for triples (x, 2/1,2/2) which satisfy 

Sl = word((x,gca(x,2/i)), s 2 = word^x, 2/2), S3 = word t (gca(x, 2/1), 2/1). 

In general, the guidance system will be a disjoint union over all triples (sj, ^2,53) € F. 
Consider an arbitrary graph tree (t, E) over which tp is evaluated, together with an arbitrary 
subset X of source nodes. We aim at constructing a guidance system whose number of colors 
depends only on <p, that directs any source node x £ X to either 7/1 or y 2 , for some triple 
(x, 2/1,2/2) selected by ip. We will do it in two stages. In the first stage, the node x is 
directed either to 2/2, or to y = gca(x,yi). By Theorem [3] we will be able to assume that 
this guidance system is consistent. In the second stage, every y node will be directed either 
to an appropriate y\ node, or to the y 2 node, using Lemma [8j Finally, we will compose the 
two guidance systems using Lemma |4j 

Formally speaking, for the first stage we use the simplified query cr sl)S2 [x, y, y 2 ] that 
selects a triple (x,y,y 2 ) if 

• o(word i (x,2/)) = si 

• a(word t (x,y 2 )) = s 2 

• y < x < 2/2 

• x is in the right subtree of y. 

The idea now is that the query <t S1jS2 is evaluated over a modified graph tree (t,E S3 ). 
The relation E S3 is defined as follows: (y, y 2 ) is in E S3 iff (2/1,2/2) £ E for some y\ such that 

• V = gca(2/i,2/ 2 ), 

• 2/1 is in the left subtree of y, 

• a(word t (y,2/i)) = s 3 . 

Intuitively, the first node 2/1 of every edge (2/1,2/2) £ E is moved to y = gca(y\,y 2 ), but 
only if the equation a(wordf(y, yi)) = S3 holds. Note that we use here the more general 
notion of graph trees, rather than data trees, as the relation E S3 is not an equivalence in 
general. By Theorem [3] we know that <r SliS2 has a witness function induced by a consistent 
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m-color guidance system II, where m only depends on o~ S2)Sz and does not depend on t, E 
or X. 

Each source node x is directed so far either to some y < x, or to yi > x. Without loss 
of generality we may assume that colors used for target nodes of the first kind are distinct 
from colors used for target nodes of the other kind. For the second stage, consider the set Y 
of all target nodes y of II of the first kind, and restrict attention to the induced sub-guidance 
system of II. Note that every such node y G Y has an associated node y\ located in the 
left subtree of y such that a(wordt(y, y±)) = S3. Moreover, due to consistency of II, y has 
also an associated node 2/2 located in the right subtree of y such that a (word (y, 2/2)) = s% s 2- 
Consider the set of all such triples (y, 2/1,2/2) and apply Lemma[8]to obtain a 2-color guidance 
system that directs every y G Y to one of its two associated nodes. 

Finally, using Lemma [4] we obtain a guidance system for p of bounded width. 

As the graph tree (t, E) and the subset X were chosen arbitrarily, this completes the 
proof of Theorem [2j 

5. Proof of Theorem [3] 

Fix in this section a simplified regular query ip [x, 2/1, 2/2], i.e., satisfying the conditions ([!]) 
and ([2]) from Section [4j Since the query is regular, the dependency stated in item Q is a 
regular dependency. We may thus assume a semigroup morphism a : X* — > S recognizing 
p, which maps each word to an element of a finite semigroup S. Whether or not a triple 
(x, 2/1,2/2) is selected by (p depends only on the images 

si = a(word t (2/i,x)) G S (5.1) 

s 2 = a(wordt(x, 2/2)) G S. (5.2) 

In other words, there is a set of accepting pairs F C S 2 such that p(w) is the set of triples 
(x, 2/1,2/2) with (si, S2) G F. We fix the morphism a for the rest of this section. 

We distinguish two types of edges in a graph tree (t, E). The tree edges are edges that 
connect parents with children, as well as a dummy edge going into the root of the tree and 
dummy edges going out of the leaves. The class edges are the edges from E. We order tree 
edges by the ancestor relation <, according to the positions in the tree, with the dummy 
edges coming as the least one and the maximal ones, respectively. For two tree edges e < f 
in a tree t, we write wordj(e, /) for the word labeling t on the path that begins in the target 
of e and ends in the source of /. In particular word^(e,e) = e. 

Forward Ramseyan splits. The key tool in our proof is a forward Ramseyan split, as 
defined by Colcombet in 16]. Let i be a tree labeled with X. A split of height n in t is a 
function a that maps each tree edge to a number in {1, . . . , n}. We say that two tree edges 
e < / are neighbors with respect to a split a, if a assigns the same number to e and /, 
and all tree edges between e and / are mapped to at most cr(e). A split a is called forward 
Ramseyan with respect to a morphism a if 

a(word t (e, /)) = a(word t (e, g)) (5.3) 

holds for every three pairwise neighboring tree edges e < f < g. The following theorem was 
shown in |6|. 
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Theorem 4. Fix a morphism a : S* — > S. Every tree t has a forward Ramseyan split of 
height 0(|5|). Furthermore, the split is top-down deterministic in the sense that all tree 
edges from a node to its children are assigned the same number in the split. 

From now on wlog we consider only complete binary trees, where each non-leaf node 
has precisely two children. 

Factors. Two comparable wrt. < (i.e., belonging to one path) tree edges e and / are called 
visible if all tree edges between e and / are mapped by the split to values strictly smaller 
than a(e) and a(f). Visible pairs of tree edges naturally determine a nested factorization 
of t in the following way. 

A pre-factor in a tree t is a connected set of nodes (connected by tree edges) such 
that if a node x is in the pre-factor, then either all children of x are in the pre-factor, or 
none of them. Each pre-factor of t has a root and some leaves (maximal nodes wrt. <), 
and inherits its edges from t. In the definitions below, we talk about tree edges and not 
class edges. We distinguish internal edges of a pre-factor, connecting two nodes in that 
pre-factor, and external edges connecting the root or the leaves with some node outside 
the pre-factor. This includes the tree edge leading to the root of the pre-factor (called the 
root edge of the pre-factor) and the tree edges going out of the leaves (called the leaf edges 
of the pre-factor). Note that external edges may be either proper tree edges, or dummy 
edges. As the split a is assumed to be deterministic, all tree edges leaving a given leaf of a 
pre-factor are assigned the same number. A pre-factor F is called a factor in t if it respects 
the split a in the following way: the root edge is visible from each of the leaf edges. This 
means that on each (shortest) path in a factor from its root edge to a leaf edge, numbers 
assigned by a to the internal edges on that path are strictly smaller than those assigned to 
the two external edges. By the height of a factor we mean the greatest number assigned to 
an internal edge, or if no such edge exists (the case of one-node pre-factor). Additionally, 
the whole tree t is also a factor if, wlog, we assume that the root dummy edge is visible 
from all leaf dummy edges; its height at most equals the height of a. 

A subf actor of a factor F is any factor GCF that is maximal with respect to inclusion. 
By the definition of factor, we get: 

Claim 1. Every two different subfactors of F are disjoint (have disjoint sets of nodes, but 
possibly share an external edge). 

Proof. Indeed, assume two non-disjoint different subfactors Fi,i<2 of some factor F. The 
root node of one of them, say the root node r\ of F±, is necessarily contained in the other. 
As F\ is not included in F2, there must be a leaf node l\ of F\ not contained in F2; the path 
that leads to that leaf node passes though a leaf node I2 of F2. If we denote by r2 the root 
node of F2 we know that the four nodes are located on one path in the following order: 

T2 < r\ < h < h- 

This arrangement of the nodes is in a clear contradiction with the assumption that the tree 
edge incoming to r% is visible from (all) tree edges outgoing from Zj, for i = 1, 2. □ 

Hence each factor F is the disjoint union of its subfactors. We say a subfactor G is an 
ancestor of a subfactor H if their roots are so related. Likewise we talk about a subfactor 
being a child or parent of some other subfactor. 

A factor of height 2 together with its decomposition into subfactors is pictured below. 
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9 9 6 9 9 9 6 *9 

4 3 3 92 443 3 2 2 



3 344 VP 3 3 4 4 



Our proof of Theorem [3] is based on the Main Lemma stated below. The lemma is 
proved by induction on the height of factors. To state the lemma recall the notion of 



consistent guidance system introduced in Section 4.7 



Lemma 9 (Main Lemma). Fix a factor height h. There is a bound n 6 N, depending only 
on <p and h, such that for every graph tree (i, E), every factor F in t of height h, and every 
set ICFof source nodes, there is a witness function for <p and X in (t, E) induced by a 
consistent guidance system using at most n colors. Furthermore, this witness function only 
points to descendants of the root of F. 

The proof of the lemma is by induction on the height h. The number of colors n will 
depend on h and the size of the monoid S recognizing the query. It will not depend on t. 
When going from height h to height h + 1 , there will be a quadratic blowup in the number 
of colors. Therefore, n will be doubly exponential in the height of F. 

Since the witness function will be induced by a guidance system, the last assumption 
in Lemma [9] could be restated as saying that no guide passes through the root edge of F. 
Theorem [3] is a special case of the Main Lemma when F is the whole tree. 

The base case when h = 0, and hence the factor F has one or no nodes, is easy (1 
color, going downwards, is sufficient). For the induction step, fix a factor F, and assume 
that there is a bound n sufficient for any factor of smaller height than F, which includes 
all subfactors of F. Below, subfactors of F are simply called subfactors, without explicitly 
referring to F. 

A tree edge of F that is an external edge of one if its subfactors is called a border edge. 
In particular each external edge of F is a border edge. Special care will be paid in our 
proof to internal (i.e. not external) border edges, i.e., the edges that connect one subfactor 
to another. 

Claim 2. If two internal border edges in a factor are comparable by the ancestor relation 
< then they are assigned the same value by the split. 

Proof. Assume two internal border edges e < /, with different values assigned by the split, 
such that no other internal border edge is located on the (shortest) path from e to /. As 
values assigned to all external border edges are strictly larger, one of e, / is visible from 
some (possibly external) border edge "over" the other. That is, either e is visible from some 
e' > /, or / is visible from some /' < e. In both cases, one of e, / is an internal edge of 
some subfactor - a contradiction. O 
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We do a case distinction, regarding the number of internal border edges on the paths 
from a source to witness nodes. For a node x G X and a witness (7/1,7/2) we define two 
numbers m\,mi- Let rrt\ be the number of internal border edges on the path between y\ 
and x, and let 777,2 be the number of internal border edges on the path between x and yi. 
For technical convenience, we deliberately choose not to count external border edges. We 
divide the set X into three parts: 

X\ Nodes x €z X that have a witness with 771,2 < 1- 

X2 Nodes i£l that have a witness with mi < 1 and 7772 > 2. 

X3 Nodes x £ X that have a witness with mi, 777,2 > 2. 
We prove the Main Lemma for each of the three parts separately. Next, we combine the 
three guidance systems into a single guidance system. Our construction will yield two 
kinds of guides: the ancestor guides pointing to the first witness and thus going up a 
tree; and descendant guides pointing to the second witness, and thus going down the tree. 
Interestingly, ancestor guides will be only created in case of nodes from X%. All the guides 
will satisfy the consistency condition required in Lemma [9j 

Nodes from X\, i.e. nodes that have a witness with 7772 < 1. Consider a subfactor 
G of F. In this case, each node x G X\ n G has a descendant witness 7/2 that is either in G, 
or in a child subfactor of G, or perhaps outside F. Apply the induction assumption to G, 
producing a guidance system He with at most 77 colors. Since the Main Lemma requires 
the guidance system to point to descendants of the factor's root, and 777,2 < 1, we infer 
that inside F the guides of Uq can only intersect G and its child subfactors, and no other 
subfactors (it is possible that the guides leave the factor F, though). Therefore, all the 
guidance systems He, for all subfactors G of F, can be combined into a single guidance 
system with at most 2n colors, used alternatingly for even and odd depths. 

Nodes from X2, i.e. nodes that have a witness with mi < 1 and 7772 > 2. In this 
case, for each node x G X% there is an ancestor witness y\ that is either in the subfactor 
of x, in the parent subfactor, or outside F. Note that the latter is possible only when x 
belongs either to the root subfactor of F, or to some of its child subfactors; denote this set 
of subfactors by Go- We will construct the guidance system in a step-by-step manner, for 
all subfactors, according to the ancestor ordering. 

Formally speaking, consider a family Q of subfactors that is closed under ancestors and 
includes Go- We provide a guidance system Tig of 4n 2 + 3n colors that provides witnesses 
for all nodes of X belonging to the subfactors in Q. The construction of Tig is by induction 
on the number of subfactors in Q. 

The induction base is when Q equals Go- For each subfactor G G Go, we apply the 
Main Lemma, for the smaller height, to G and nodes from X2 that belong to G, yielding an 
n-color guidance system. We combine these guidance systems into IIg as follows: use one 
set of 77 colors for the root subfactor, and another set of n colors for all the child subfactors 
of the root. 

For the induction step, suppose that we have already constructed Ilg for G, and that 
G G" G is a subfactor whose parent is in G- Consider the guides of Tig that pass through the 
root edge of G. We apply two distinctions to these guides. First, we use the name parent 
guides, for the guides that originate in the parent subfactor of G, and the name far guides 
for the other guides. Second, we use the name ending guides for the guides whose target is 
in G and the name transit guides for the other guides, which continue into a child subfactor 
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of G, or even exit F. Altogether, there are four possibilities: parent transit guides, far 
ending guides, etc. We assume additionally that there are at most n parent guides and at 
most In far and parent guides altogether, and hence at most 2n guides entering G. This 
additional invariant is satisfied by the induction base, and it will be preserved through the 
construction. 

Apply the induction assumption of the Main Lemma to G and nodes from X2 that 
belong to G, yielding a guidance system II with n fresh colors. We use the name starting 
guides for the guides of II. We want to combine Tig with II in such a way that the resulting 
guidance system still uses at most 4n 2 + 3n colors, like Tig, and satisfies the additional 
invariant. If we were to simply take the two systems together, we might end up with a leaf 
edge of G which is traversed both by starting and transit guides, which could exceed the 
bound 2n on guides passing through border edges. 

We solve this problem as follows. Consider the leaf edges of G that are traversed by the 
far transit guides. There are at most 2n such edges by our invariant assumption. We will 
remove all starting guides that pass any of these edges, and find other witnesses for nodes 
that use these starting guides. This guarantees that the invariant condition is recovered: 
at most In guides passes through any leaf edge of G, and at most n of them are starting 
guides. These other witnesses will be ancestors. This explains why the induction starts 
with Q containing the root subfactor and its children, since these are the subfactors that 
may have ancestor witnesses outside the whole factor F (recall that passing through the 
root edge is not counted in mi). The statement of the Main Lemma does not allow guides 
that pass through the root edge of F. 

The removing of starting guides proceeds as follows. Let e be a leaf edge of G that is 
traversed by a far transit guide, which has color j in the guidance system Tig. Let tt be a 
starting guide, which has color i in II, that also traverses e, with 7/2 its target node. By 
the consistency property of tt, there is some y\ such that (2/1,2/2) £ E is a witness pair for 
all source nodes of tt. Note that by assumption on mi < 1, the node y\ is either in the 
subfactor G or its parent. We create an ancestor guide with a fresh color that connects all 
the source nodes of tt to y\. The color of this guide, which we call an ancestor color, will 
take into account three parameters: the colors i and j, as well as a parity bit b € {0, 1}. 
The parity bit is if and only if G has an even number of ancestor subfactors. We use the 
triple (i,j,b) for the color name. 

We will show that this new ancestor guide does not conflict with any other ancestor 
guide with the same color. Each new ancestor guide is contained in G and possibly its 
parent subfactor H, by assumption on m\ < 1. Inside the subfactor G there is at most one 
ancestor guide of each color, so there are no collisions inside G. One could imagine, though, 
that a new ancestor guide tt with color (i,j,b) collides inside the parent subfactor H with 
some other ancestor guide tt' of the same color. Since the colors of tt and tt' agree on the 
parity bit b, we conclude that the guide tt' cannot originate in H, which has a different 
parity than G. Therefore, tt' must originate in some other child subfactor of H, call it G' , 
that had been previously added to Q. Since the color of tt' is also (i, j, b), we conclude that 
j was the color of a far transit guide in G' . This is impossible, since a far transit guide in 
G' or G must originate not in H, but in an ancestor of H, and therefore there would be a 
collision in the root of H. 

The ancestor guides created above are the only ancestor guides in our solution. In the 
subfactors where they are used, the ancestor guides have the target in the parent subfactor. 
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Let us count the number of colors used. We need 2re colors for the transit guides, and 
n colors for the starting guides. For the ancestor guides, we need 4n 2 colors. Altogether, 
we need 4n 2 + 3n colors. Note that all the guides satisfy the consistency condition required 
by Lemma [9} 

Nodes from X3, i.e. nodes that have a witness with mj.,m2 > 2. In this case, each 
node x E X% has a witness (2/1,2/2) such that the path from 2/1 to x, as well as the path from 
x to 1/2, passes through at least two internal border edges. This case is the only one where 
we use the forward Ramseyan split. 

Consider a source x with a witness (2/1,2/2)- The internal border edges naturally split 
wordt(2/i,x) and word t (x, 2/2) into m\+l and m2+l words, respectively: 

word 4 (2/i,x) = vq-vv . . . -v mi 

word 4 (x,2/2) = Wq-w v . . . -w m2 - 

The first letter of vq is the label of y\. The last letter of v mi and also the first letter of 
wq is the label of x. The last letter of w m2 is the label of 2/2 (recall that 2/1 or 2/2 might be 
outside F). Furthermore, each two consecutive internal border edges are not only visible, 
but also neighbouring, by Claim [2j Hence, as we have a forward Ramseyan split (cf. (5.3)), 



the values a{woT&t{y\,x)) and a(wordf(x, 2/2)) are determined by the first two parts and 
the last part: 

(i) a(word t (2/i,x)) = a(v )-a(vi)-a(v mi ) 

(ii) a(word t (», 2/2)) = ct(w Q )-a(wi)-a(w m2 ). 

Let us fix six values si, . . . , sq £ S. By splitting the set X3 into at most \S\ 6 parts, each 
requiring a separate guidance system, we can assume that each x £ X3 has a witness where 

si = a(v ) s 2 = a(vi) s 3 = a(v mi ) 

s 4 = ot(w ) s 5 = a(wi) s 6 = a(w m2 ). 

We will only consider witnesses that satisfy the assumptions above. 

We now proceed to create the guidance system. As in the case mi < 1, the guidance 
system will be defined for a family Q of subfactors that is closed under ancestors. The 
guidance system will use at most 3 colors, and will have the following additional invariant 
property: if e is an edge that connects a subfactor G with a child subfactor H, then at most 
two guides pass through e. Furthermore, if exactly two guides pass through e, then one of 
the guides has its target in H. 

The construction is by induction on the number of subfactors in Q. The induction base 
when Q has no subfactors is obvious. Below we show how to modify a guidance system Tig 
for Q when adding a new subfactor G. 

Consider the (at most two) guides of Tig that pass through the edge connecting G 
to Ilg. As in the case ra\ < 1, we use the term transit guide for the guides of Tig that 
enter G through its root and exit through one of its external leaf edges. By the invariant 
assumption, there is at most one transit guide. 

We now define a guidance system IT for the nodes in G, which we will next combine 
with Ilg. 

Claim 3. There is a one-color guidance system IT defining a witness function for all nodes 
in GnX 3 . 
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Proof. For each node x G G n X3, choose the lexicographically first witness 2/2 > x that 
satisfies the assumptions on the six images in the semigroup, call it y x . Let Y be all 
these witnesses y x , for x G G n A3; this set is an antichain with respect to the descendant 
relation. For each y £Y, let X y be the chain of nodes x which are witnessed by y. By the 
lexicographic assumption, if y,y' G Y are such that y is lexicographically before y' , then 
no element from X y has an ancestor in X y /. Consequently, if we define ir y to be the guide 
that connects all X y to y, then II = {ir y } y£ Y is a one-color guidance system for all nodes 
in G ' .V :i . ' □ 

The guides of II we call starting guides as usual. 

We now need to combine II and Tig. If we simply combine Tig and EE, we might end up 
with a starting guide going through and external edge of G that is already traversed by two 
transit guides. To avoid this problem, we need to do an optimisation relying on a simple 
observation formulated in the claim below. 

A descendant guide tt is called live in a subfactor G if tt passes through G, and its 
target is not in a child subfactor of G (i.e., the target is in a proper descendant of some 
child of G, or outside F). The idea is that the target of tt satisfies the assumption mi > 2 
from the 'point of view' of nodes in G. Note that guides live in G may be either transit or 
starting in G. 

Claim 4. Suppose that two consistent descendant guides tt and n' are live in a subfactor 
G and exit G through the same edge. Suppose also that at least one of them is starting in 
G. Then all source nodes of one of tt, tt' can be moved to the other. 

Proof. Let (2/1,2/2) be the witness pair corresponding to tt and let {y 1,1/2) be the witness pair 
corresponding to tt' - by consistency, not only the second witnesses 2/2 and y' 2 are determined 
by tt and tt' , but the whole witness pairs. Note that the source nodes of a descendant guide 
in G are all situated on one path from the root of G to one of the leaves. If both tt and tt' 
are starting in G, assume wlog that tt has a source node that is an ancestor of all source 
nodes of tt'; otherwise one of the guides is starting in G, wlog assume it is tt' . As only the 
case of mi, m% > 2 is considered, and the values si, . . . , s% are fixed, due to equation ( |5.4[ )(i) 
the pair (2/1,2/2) is a witness for all source nodes of tt' as well. Thus, these nodes may be 
guided to 2/2 instead of y' 2 . □ 

We use the term live transit guide for the transit guides that are live in G, and dead 
transit guide for the other transit guides (those that have their target in a child subfactor 
of GO- 

Consider an edge e that connects G with a child subfactor H. Suppose first that e 
is traversed by a starting guide and a live transit guide. Using the claim we merge the 
starting guide with the transit one. Therefore, we end up satisfying the invariant property: 
e is passed by at most one live guide, and possibly by one dead transit guide. 

We have thus completed the proof of the Main Lemma and thus also of Theorem [3} 

6. Applications 

In this section, we present two applications of our results. The first application is a class 
of XML documents for which emptiness of XPath is decidable. The second application is 
a proof that two-variable first-order logic is not captured by XPath, in the presence of two 
attribute values per node. 
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Satisfiability of XPath. As we said in the introduction, our study on class automata is 
a first step in a search for structural restrictions on data words and data trees which make 
XPath satisfiability decidable. 

One idea for a structural restriction would be a variant of bounded clique width, or 
tree width. Maybe bounded clique or tree width are interesting restrictions, but they are 
not relevant in the study of class automata. This is because bounded clique width or tree 
width, when defined in the natural way for data trees, guarantees decidable satisfiability for 
a logic far more powerful than class automata: MSO with navigation and equal data value 
predicates. 

Here we provide a basic example of a restriction on inputs that works for class automata 
but not for MSO. A data tree is called bipartite (bipartite refers to the data) if its nodes 
can be split into two connected (by the child relation) sets X, Y such that every class has 
at most one node in X and at most one node in Y. 

Satisfiability of MSO, or even FO, with navigation and data equality predicates is unde- 
cidable even for bipartite data words. For instance, a solution to the Post Correspondence 
Problem can be encoded in a bipartite data word using a FO formula. 

This coding, however, cannot be captured by class automata, which is implied by the 
following theorem. 

Theorem 5. On bipartite data trees, emptiness is decidable for class automata, and there- 
fore also for XPath. 

Proof. The key insight is that data trees which use each data value at most twice can be 
described using semilinear sets. To avoid notational complications, assume that every data 
value appears exactly twice in a bipartite tree. 

Consider a class automaton A over input alphabet S. Suppose that the work alphabet 
is r, and the transducer is /. Let the class condition be a language over alphabet V x {0, 1}, 
recognized by a deterministic bottom- up tree automaton C, with states Q. For technical 
convenience, assume that a run of C labels tree edges, instead of tree nodes, with states. 
Thus all leaf dummy edges of an input tree u over T x {0, 1} are labeled with the initial 
state of C, and the labeling of all other edges is uniquely determined by u. Let C(u) denote 
the state that labels the root dummy edge of u. A tree u is accepted if C{u) is an accepting 
state. 

By assumption, the nodes of a bipartite data tree (t, ~) are partitioned into two con- 
nected subsets, and thus there is a single edge that splits the two subsets; call this edge the 
border edge. Such a tree t with a distinguish edge may be modeled as t {z, z'}, where z 
and z' are the two nodes connected by the border edge. Further, t may be split into two 
smaller trees, according to the partition: the border edge is a dummy root edge for one of 
the trees, and a leaf dummy edge for the other one. Call these two induced trees lower and 
upper tree, respectively. 

For the lower tree of t, call it ti, and a subset X of nodes of ti, it makes sense to write 
C(ti <g> X). For the upper one, call it t u , we will need a slightly different notation. Assume 
that the automaton C reads t u ®X, for some X, starting in the initial state in all dummy leaf 
edges of t u except for the border edge, where the automaton starts in some chosen state q. 
If C accepts the tree t u ®X under this assumption, we write q G C _1 (t u <X> X). In particular, 
observe that the automaton C accepts 1 <g> X if and only if C{ti Xi) £ C~ l {t u ® X u ), where 
X = X\ U X u is the induced partition of X. 
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Given a tree t over V with a single distinguished edge, say t (g> {z, z'}, let 7r(i {z, z'}) 
be the set of all trees s over Q x {/, u} that satisfy the following conditions: 

• the set of nodes of s is the same as the set of nodes of t, 

• if a node x of s is in ti then it is labeled by (C(ti (g) {x}), I), 

• if a node x of s is in t u then it is labeled by (q, u), for some q G C _1 (t u (8) {a;}). 

Using 7T, we define the relation cr between trees over F and trees over Q x {l,u} as follows: 
(t, s) G <7 if s G 7r(£ g) {z, z'}) for some edge (z, z') in t. 

Claim 5. The relation a is computable by a nondeterministic letter-to-letter transducer. 

Let Ts denote the set of all trees over X. As nondeterministic transducers preserve 
regular languages, we have: 

Claim 6. Both K = /(?s) and a(K) = {s : (t,s) G cr, t G K} are effectively computable 
regular languages over T and Q x {/,«}, respectively. 

Claim 7. A class automaton A accepts some bipartite data tree if and only if cr(K) contains 
a tree that satisfies, for every q G Q, the following condition: the number of nodes labeled 
by (q, I) is the same as the number of nodes labeled by (q, u). 

With the last two observations decidability follows immediately. To decide emptiness of 
a given class automaton A, compute the Parikh image of the language cr(K), an effectively 
semi-linear set by Claim |6j intersect this set with the semi-linear condition of Claim [7j and 
ckeck for emptiness of the resulting semi-linear set. □ 

Multiple attributes. Heretofore, we have studied data trees, which model XML docu- 
ments where each node has one data value. In this section, and this one only, we consider 
the situation where each node x has n data values. Formally, an n-data tree consists of a 
tree t over the finite alphabet and functions d\ , . . . , d n which map the tree nodes to data 
values. How does XPath deal with multiple data values? Instead of y\ ~ 1/2 and y\ ^ 1/2, 
we can use any formula of the form 

di(yi) = dj(y 2 ) where i,j G {1, . . . , n} 

or its negation (for inequality). For n > 1 we need more information than just the partitions 
of nodes into classes of ~j, for each % G {1, . . . , n}. An example is the property "every node 
has the same data value on attributes 1 and 2" . 

How do we extend class automata to read n-data trees? For one data value, the class 
condition is a language over the alphabet T x {0, 1}. For n data values, the class condition 
is a language over the alphabet T x {0,l} n . An n-data tree (t, d\, . . . , d n ) is accepted if 
there is an output s of the transducer on t such that for every data value d, the tree 

sgid^id)® ■■■®d~ 1 (d) 

is accepted by the class condition. By the same technique as in the proof of Theorem [TJ we 
can prove that the automata capture XPath. 

A consequence is that for n > 2, XPath does not capture two- variable first-order logic 
(unlike the case of n = 1). This was an open question. 

Theorem 6. The following (two- variable) property 

ip = VxVy di(x) = d\(y) ^=> d 2 (x) = d 2 (y) 
cannot be defined by a boolean query of XPath. 
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Proof. Towards a contradiction, suppose that ip is recognized by any class automaton in the 
generalized version for 2 data values. The document that will confuse the automaton will 
be a word. Consider the document (over a one letter alphabet) with positions 1, . . . ,2n, 
where the data values are defined 

d± (i) = d± (n + i) = i for i G { 1, . . . , n} 

d2(i) = d<i(n + i) = — i for % G {1, . . . , n} 

Since the above document satisfies if), it should also be accepted by the automaton. Let the 
work alphabet of the automaton be V, and let a\ ■ ■ ■ (i2 n £ T* be the word produced by the 
automaton in the accepting run. For a data value d, we use the term class string of d for 
the word 

ai ■ ■ ■ a,2n ® ( ^r 1 (^) ® ^(d). 
By definition of the way class automata accept documents, each class string should belong 
to the class condition. Consider a number i G {1, . . . , n}. The class string of i is 

Ui= ai ■ ■ ■ a 2n <8) {i, n + i} (g) 

Suppose that 

a : (r x {0,1} x {0,1})* -»• 5 

is a morphism recognizing the class condition. If n is greater than |S*| 2 , then we can find 
two data values i < j G {1, . . . , n} such that 

a(ui|{l, . . . ,n}) = a(uj|{l,...,n}) 

a(ui|{n + 1, . . . , 2n}) = a(uj\{n + 1, . . . , 2n}). 

Consider a new document obtained from the previous one by swapping the first, but not 
second, data value in the positions i and j. This new document violates the property if). 
To get the contradiction, we will construct an accepting run of the class automaton for this 
new document. The output of the transducer is the same a± ■ ■ ■ a,2 n - The class strings are 
the same for data values other than i and j, so they are also accepted. For the class strings 
of i and j, the images under a are the same by assumption on i and j, and hence they are 
also accepted. □ 
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